Robust Data Partitioning for Ad-hoc Query Processing

ثبت نشده
چکیده

Data partitioning can significantly improve query performance in distributed database systems. Most proposed data partitioning techniques choose the partitioning based on a particular expected query workload or use a simple upfront scheme, such as uniform range partitioning or hash partitioning on a key. However, these techniques do not adequately address the case where the query workload is ad-hoc and unpredictable, as in many analytic applications. The Hyper-Partitioning system aims to fill that gap, by using a novel spacepartitioning tree on the space of possible attribute values to define partitions incorporating all attributes of a dataset. The system creates a robust upfront partitioning tree, designed to benefit all possible queries, and then adapts it over time in response to the actual workload. This thesis evaluates the robustness of the upfront hyper-partitioning algorithm, describes the implementation of the overall Hyper-Partitioning system, and shows how hyper-partitioning improves the performance of both selection and join queries. Thesis Supervisor: Samuel Madden Title: Professor

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Amoeba: A Shape changing Storage System for Big Data

Data partitioning significantly improves the query performance in distributed database systems. A large number of techniques have been proposed to efficiently partition a dataset for a given query workload. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload upfront. Furthermore, workloads change over time as ...

متن کامل

Robust Data Transformations

Background. Massively parallel data processing systems are ubiquitous in today’s big data era. Examples include Hadoop, Spark, Stratosphere, and a number of tools developed on top of them. Users of these systems upload their datasets to a distributed le system and run their analysis in a distributed fashion. However, several analyses require a variety of data preparation steps in order to perf...

متن کامل

Data Warehouse Query Processing and Optimization Architecture

Data warehouse query processing must satisfy different requirements such as: simple/complex front-end ad hoc query, query used in the applications including data mining applications, query used to obtain information from metadata containing structured, unstructured and semi-structured data such as XML (eXtended Markup Language) documents. In this paper, we will explain several robust algorithms...

متن کامل

An Adaptive Partitioning Scheme for Ad-hoc and Time-varying Database Analytics

Data partitioning significantly improves query performance in distributed database systems. A large number of techniques have been proposed to efficiently partition a dataset, often focusing on finding the best partitioning for a particular query workload. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload. F...

متن کامل

SAMUEL: A Sharing-based Approach to processing Multiple SPARQL Queries with MapReduce

The volume of RDF data is now growing tremendously. It is thus considered prudent to store and process massive RDF data with distributed SPARQL engines instead of relying on a singlemachine system.Many sophisticated index and partitioning schemes have also been proposed to support SPARQL query evaluations. However, existing SPARQL engines have mainly followed oneat-a-time scheme so that query e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015